Provable Convex Co-clustering of Tensors

Cluster analysis is a fundamental tool for pattern discovery of complex heterogeneous data. Prevalent clustering methods mainly focus on vector or matrix-variate data and are not applicable to general-order tensors, which arise frequently in modern scientific and business applications. Moreover, there is a gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of their non-convex formulations. In this work, we bridge this gap by developing a provable convex formulation of tensor co-clustering. Our convex co-clustering (CoCo) estimator enjoys stability guarantees and its computational and storage costs are polynomial in the size of the data. We further establish a non-asymptotic error bound for the CoCo estimator, which reveals a surprising “blessing of dimensionality” phenomenon that does not exist in vector or matrix-variate cluster analysis. Our theoretical findings are supported by extensive simulated studies. Finally, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to improve advertising effectiveness.


Introduction
In this work, we study the problem of finding structure in multiway data, or tensors, via clustering. Tensors appear frequently in modern scientific and business applications involving complex heterogeneous data. For example, data in a neurogenomics study of brain development consists of a 3-way array of expression level measurements indexed by gene, space, and time . Other examples of 3-way data arrays consisting of matrices collected over time include email communications (sender, recipient, time) (Papalexakis et al., 2013), online chatroom communications (user, keyword, time) (Acar et al., 2006), bike rentals (source station, destination station, time) (Guigourès et al., 2015), and internet network traffic (source IP, destination IP, time) (Sun et al., 2006). The rise in tensor data has created new challenges in making predictions, such as in recommender systems for example (Zheng et al., 2016;Symeonidis, 2016;Symeonidis and Zioupos, 2016;Frolov and Oseledets, 2017;Bi et al., 2018) as well as inferring latent structure in multiway data (Acar and Yener, 2009;Anandkumar et al., 2014;Cichocki et al., 2015;Sidiropoulos et al., 2017).
As tensors become increasingly more common, the need for a reliable co-clustering method grows increasingly more urgent. Prevalent clustering methods, however, mainly focus on vector or matrix-variate data. The goal of vector clustering is to identify subgroups within the vector-variate observations (Ma and Zhong, 2008;Shen and Huang, 2010;Shen et al., 2012;Wang et al., 2013). Biclustering is the extension of clustering to two-way data where both the observations (rows) and the features (columns) of a data matrix are simultaneously grouped together (Hartigan, 1972;Madeira and Oliveira, 2004;Busygin et al., 2008). In spite of their prevalence, these approaches are not directly applicable to the cluster analysis of general-order (general-way) tensors. On the other hand, existing methods for co-clustering general D-way arrays, for D ≥ 3, employ one of three strategies: (i) extensions of spectral clustering to tensors (Wu et al., 2016b), (ii) directly clustering the subarrays along each dimension, or way, of the tensor using either k-means or variants on it (Jegelka et al., 2009), and (iii) low rank tensor decompositions (Sun et al., 2009;Papalexakis et al., 2013;Zhao et al., 2016). While all these existing approaches may demonstrate good empirical performance, they have limitations. For instance, the spectral co-clustering method proposed by Wu et al. (2016b) is limited to nonnegative tensors and the CoTeC method proposed by Jegelka et al. (2009), like k-means, requires specifying the number of clusters along each dimension as a tuning parameter. Most importantly, none of the existing methods provide statistical guarantees for recovering an underlying co-clustering structure. There is a conspicuous gap between statistical guarantees and computational efficiency for existing tensor clustering solutions due to the nature of the non-convex formulations of the previously mentioned works.
which reveals a surprising "blessing of dimensionality" phenomenon: As the dimensions of the array increase, the CoCo estimator is still consistent even if the number of underlying co-clusters grows as a function of the number of elements in the tensor sample. More importantly, an underlying co-clustering structure can be consistently recovered with even a single tensor sample, which is a typical case in real applications. This phenomenon does not exist in vector or matrix-variate cluster analysis.

ii.
The CoCo estimator possesses stability guarantees. In particular, the CoCo estimatoris Lipschitz continuous in the data and jointly continuous in the data and its tuning parameter. We emphasize that Lipschitz continuity in the data guarantees that perturbations in the data lead to graceful and commensurate variations in the cluster assignments, and the continuity in the tuning parameter can be leveraged to expedite computation through warm starts. iii.
The CoCo estimator can be iteratively computed with convergence guarantees via an accelerated first order method with storage and per-iteration cost that is linear in the size of the data.
In short, the CoCo estimator comes with (i) statistical guarantees, (ii) practically relevant stability guarantees at all sample sizes, and (iii) an algorithm with polynomial complexity. The theoretical properties of our CoCo estimator are supported by extensive simulation studies. To demonstrate its business impact, we apply the CoCo estimator to the cluster analysis of advertisement click tensor data from a major online company. Our clustering results provide meaningful business insights to help advertising planning.
Our work is related to, but also clearly distinct from, a number of recent developments in cluster analysis. The first related line of research tackles convex clustering (Hocking et al., 2011;Zhu et al., 2014;Chi and Lange, 2015;Chen et al., 2015;Tan and Witten, 2015;Wang et al., 2018;Radchenko and Mukherjee, 2017) and convex biclustering (Chi et al., 2017). These existing methods are not directly applicable to general-order tensors, however. Importantly, our CoCo estimator enjoys a unique "blessing of dimensionality" phenomenon that has not been established in the aforementioned approaches. Moreover, the CoCo estimator is similar in spirit to a recent series of work approximating a noisy observed array with an array that is smooth with respect to some latent organization associated with each dimension of the array (Gavish and Coifman, 2012;Ankenman, 2014;Mishne et al., 2016;Yair et al., 2017). Our proposed CoCo procedure seeks an approximating array that is smooth with respect to a latent clustering along each dimension of the array. While CoCo shares features with these array approximation techniques, namely the use of data-driven similarity graphs along tensor modes, a key distinction between our CoCo estimator and these methods is that CoCo produces an approximating array that explicitly recovers hard co-clustering assignments. As we will see shortly, focusing our attention in this work on the co-clustering model paves the way to the discovery and explicit characterization of new and interesting fundamental behavior in finding intrinsic organization within tensors.
the co-clustering problem. In Section 4, we establish the stability properties and prediction error bounds of the CoCo estimator. In Section 5, we describe the algorithm used to compute the CoCo estimator. In Section 6, we discuss how to specify weights used in our CoCo estimator, and in Section 7 we give guidance on how to set and select tuning parameters used in the CoCo estimator in practice. In Section 8, we present simulation results. In Section 9, we discuss the results of applying the CoCo estimator to co-cluster a real data tensor from online advertising. In Section 10, we close with a discussion. The Appendix contains a brief review of the two main tensor decompositions that are discussed in this paper, all technical proofs, as well as additional experiments.

Notation
We adopt the terminology and notation used by Kolda and Bader (2009). We call the number of ways or modes of a tensor its order. Vectors are tensors of order one and denoted by boldface lowercase letters, e.g. a. Matrices are tensors of order two and denoted by boldface capital letters, e.g. A. Tensors of higher-order, namely order three and greater, we denote by boldface Euler script letters, e.g. A. Thus, if A represents a D-way data array of size n 1 × n 2 × ··· × n D , we say A is a tensor of order D. We denote scalars by lowercase letters, e.g. a.
We denote the ith element of a vector a by a i , the ijth element of a matrix A by a ij , the ijkth element of a third-order tensor A by a ijk , and so on.
We can extract a subarray of a tensor by fixing a subset of its indices. For example, by fixing the first index of a matrix to be i, we extract the ith row of the matrix, and by fixing the second index of a matrix to be j, we extract a jth column of the matrix. We use a colon to indicate all elements of a mode. Consequently, we denote the ith row of a matrix A by A i: and the jth column of a matrix A by A :j . Fibers are the subarrays of a tensor obtained by fixing all but one of its indices. In the case of a matrix, a mode-1 fiber is a matrix column and a mode-2 fiber is a matrix row. Slices are the two-dimensional subarrays of a tensor obtained by fixing all but two indices. For example, a third-order tensor A has three sets of slices denoted by A i: : , A : j: , and A : : k .

Basic Tensor Operations
It is often convenient to reorder the elements of a D-way array into a matrix or a vector. Reordering a tensor's elements into a matrix is referred to as matricization, while reordering its elements into a vector is referred to as vectorization. There are many ways to reorder a tensor into a matrix or vector. In this paper, we use a canonical mode-d matricization, where the mode-d fibers of a D-way tensor A ∈ ℝ n 1 × n 2 × ⋯ × n D become the columns of a matrix A (d) ∈ ℝ n d × n −d , where n −d = ∏ j ≠ d n j . Recall that the column-major vectorization of a matrix maps a matrix A ∈ ℝ p × q to the vector a ∈ ℝ pq by stacking the columns of A on top of each other, namely a = A : 1 number of elements in A. As a shorthand, when the context leaves no ambiguity, we denote this vectorization of a tensor A by its boldface lowercase version a.
The Frobenius norm of a D-way tensor A ∈ ℝ n 1 × n 2 × ⋯ × n D is the natural generalization of the Frobenius norm of a matrix, namely it is the square root of the sum of the squares of all its elements, The Frobenius norm of a tensor is equivalent to the ℓ 2 -norm of the vectorization of the tensor, Let A be a tensor in ℝ n 1 × n 2 × ⋯ × n D and B be a matrix in ℝ m × n d . The d-mode (matrix) product of the tensor A with the matrix B, denoted by A × d B, is the tensor of size n 1 × ⋯ × n d−1 × m × n d+1 × ⋯ × n D whose (i 1 , i 2 , ⋯ , i d−1 , j, i d+1 , ⋯ ,i D )th element is given by where I p is the p-by-p identity matrix and ⊗ denotes the Kronecker product between two matrices. The identity given in (1) generalizes the well known formula for the column-major vectorization of a product of two matrices, namely vec(BA) = (I ⊗ B)a.

A Convex Formulation of Co-clustering
We first consider a convex formulation of co-clustering problem when the data is a 3-way tensor X ∈ ℝ n 1 × n 2 × n 3 before discussing the natural generalization to D-way tensors. Our basic assumption is that the observed data tensor is a noisy realization of an underlying tensor that exhibits a checkerbox structure modulo some unknown reordering along each of its modes. Specifically suppose that there are k 1 , k 2 , and k 3 clusters along modes 1, 2, and 3 respectively. If the (i 1 , i 2 , i 3 )-th entry in X belongs to the cluster defined by the r 1 th mode-1 group, r 2 th mode-2 group, and r 3 th mode-3 group, then we assume that the observed tensor element x i 1 i 2 i 3 is given by x i 1 i 2 i 3 = c * r 1 r 2 r 3 + ϵ i 1 i 2 i 3 , joint distribution on the noise terms later in Section 4.2 in order to derive prediction bounds. Thus, we model the observed tensor X as the sum of a mean tensor U* ∈ ℝ n 1 × n 2 × n 3 , whose elements are expanded from the co-cluster means tensor C* ∈ ℝ k 1 × k 2 × k 3 , and a noise tensor E ∈ ℝ n 1 × n 2 × n 3 . We can write this expansion explicitly by introducing a membership matrix M d ∈ 0, 1 n d × k d for the dth mode, where the ikth element of M d is one if and only if the ith mode-d slice belongs to the kth mode-d cluster for k ∈ {1, … , k d }. We require that each row of the membership matrix sum to one, namely M d 1 = 1, to ensure that each of the mode-d slices belongs to exactly one of the k d mode-d clusters. Then, U* = C* × 1 M 1 × 2 M 2 × 3 M 3 . Figure 1 illustrates an underlying mean tensor U* after permuting the slices along each of the modes to reveal a checkerbox structure.
The co-clustering model in (2) is the 3-way analogue of the checkerboard mean model often employed in biclustering data matrices (Madeira and Oliveira, 2004;Tan and Witten, 2014;Chi et al., 2017). Moreover, the tensor C* of co-cluster means corresponds to the tensor of cluster "centers" in the tensor clustering work by Jegelka et al. (2009). The model is complete and exclusive in that each tensor element is assigned to exactly one co-cluster. This is in contrast to models that allow potentially overlapping co-clusters (Lazzeroni and Owen, 2002;Bergmann et al., 2003;Turner et al., 2005;Huang et al., 2008;Witten et al., 2009;Lee et al., 2010;Sill et al., 2011;Bhar et al., 2015).
Estimating the model in (2) consists of finding (i) the partitions along each mode and (ii) the mean values of each of the k 1 k 2 k 3 co-clusters. Estimating c* r 1 r 2 r 3 , given the mode clustering assignments is trivial. Let G 1 , G 2 and G 3 denote the indices of the r 1 th mode-1, r 2 th mode-2, and r 3 th mode-3 groups respectively. If the noise terms ϵ i 1 i 2 i 3 are iid N(0, σ 2 ) for some positive σ 2 , then the maximum likelihood estimate of c* r 1 r 2 r 3 is simply the sample mean of the entries of X over the indices defined by G 1 , G 2 , and G 3 , namely c r 1 r 2 r 3 * = 1 |G 1 ||G 2 | | |G 3 | ∑ i 1 ∈ G 1 ∑ i 2 ∈ G 2 ∑ i 3 ∈ G 3 x i 1 i 2 i 3 .
Finding the partitions G 1 , G 2 , and G 3 , on the other hand, is a combinatorially hard problem. In recent years, however, many combinatorially hard problems, that initially appear computationally intractable, have been successfully attacked by solving a convex relaxation to the original combinatorial optimization problem. Perhaps the most celebrated convex relaxations is the lasso (Tibshirani, 1996), which simultaneously performs variable selection and parameter estimation for fitting sparse regression models by minimizing a non-smooth convex criterion.
In light of the lasso's success, we propose to simultaneously identify partitions along the modes of X and estimate the co-cluster means by minimizing the following convex objective function where R 1 (U) = ∑ i < j w 1, ij ‖U i: : − U j: : ‖ F R 2 (U) = ∑ i < j w 2, ij ‖U : i: − U : j: ‖ F R 3 (U) = ∑ i < j w 3, ij ‖U : : i − U : : j ‖ F .
By seeking the minimizer U γ ∈ ℝ n 1 × n 2 × n 3 of (3), we have cast co-clustering as a signal approximation problem, modeled as a penalized regression, to estimate the true co-cluster means tensor U*. In the following discussion, we drop the dependence of γ in U γ and denote our estimator as U when there is no confusion. The quadratic term in (3) quantifies how well U approximates X, while the regularization term R(U) in (3) penalizes deviations away from a checkerbox pattern. The nonnegative parameter γ tunes the relative emphasis on these two terms. The parameters w d,ij are nonnegative weights whose purpose will be discussed shortly.
To appreciate how the regularization term R(U) steers the minimizer of (3) towards a checkerbox pattern, consider the effect of one of the terms R d (U) in isolation. Specifically, suppose that R(U) = R 1 (U). When γ is zero, the minimum of (3) is attained when U = X.
Or stated another way, U i: : = X i: : for i ∈ {1, … , n 1 }. As γ increases, the mode-1 slices U i: : will shrink towards each other and in fact coalesce due to the non-differentiability of the Frobenius norm at zero. In other words, as γ gets larger, the pairwise differences of the mode-1 slices of U will become increasingly sparser. Sparsity in these pairwise differences leads to a natural partitioning assignment. Two mode-1 slices X i: : and X j: : are assigned to the same mode-1 partition if U i: : = U j: : . Under mild regularity conditions, that we will spell out in Section 4, for sufficiently large γ, all mode-1 slices U will be identical and therefore belong to a single cluster. Similar behavior holds if R(U) = R 2 (U) or When R(U) includes all three terms R d (U) for d = 1, 2, 3, pairs of mode-1, mode-2, and mode-3 slices are simultaneously shrunk towards each other and coalesce as the parameter γ increases. By coupling clustering along each of the modes simultaneously, our formulation explicitly seeks out a solution with a checkerbox mean structure. Moreover, we will show in Section 4 that the solution U produces an entire solution path of checkerbox co-clustering estimates that varies continuously in γ. The solution path spans a range of models from the least smoothed model, where U is X and each tensor element occupies its own co-cluster, to the most smoothed model, where all the elements of U are identical and all tensor elements belong to a single co-cluster.
The nonnegative weights w d,ij fine tune the shrinkage of the slices along the dth mode. For example, if w 1, ij > w 1, i′j′ , then there will be more pressure for U i: : and U j: : to fuse than for U i′: : and U j′: : to fuse as γ increases. Thus, the weight w d,ij quantifies the similarity between the ith and jth mode-d slices. A very large w d,ij indicates that the two slices are very similar, while a very small w d,ij indicates that they are very dissimilar. These pairwise similarities motivate a graphical view of clustering. For the dth mode, define the set E d as the edge set of a similarity graph. Each slice is a node in the graph and the set E d contains an edge (i, j) if and only if w d,ij > 0. Figure 2 shows an example of a mode-1 similarity graph, which corresponds to a tensor with seven mode-1 slices and positive weights that define the edge set E 1 = (1, 2), (2, 3), (4, 5), (4, 6), (6, 7) .
Given the connectivity of the graph, as γ increases, the slices U 1: : , U 2: : , and U 3: : will be shrunk towards each other while the slices U 4: : , U 5: : , U 6: : and U 7: : shrunk towards each other. Since w d,ij = 0 for any (i, j) ∉ E d , we can express the penalty terms for the dth mode as The graph in Figure 2 makes readily apparent that the convex objective in (3) separates over the connected components of the similarity graph for the mode-d slices. Consequently, one can solve for the optimal U component by component. Without loss of generality, we assume that the weights are such that all the similarity graphs are connected. Before leaving this preliminary description of the weights, however, we want to emphasize that in practice weights are set once in a data-adaptive manner and should be considered empirically chosen hyper-parameters rather than tuning parameters. Further discussion of the weights and practical recommendations for specifying them will be discussed in Section 6.
Having familiarized ourselves with the convex co-clustering of a 3-way array, we now present the natural extension of (3) for clustering the fibers of a general higher-order tensor X ∈ ℝ n 1 × ⋯ × n D along all its D modes. Let Δ d, ij = e i T − e j T where e i is the ith standard basis vector in ℝ n d . The objective function of our convex co-clustering for a general higher-order tensor is as follows.
The difference between the convex triclustering objective (3) and the general convex coclustering objective (4) is in the penalty terms. Previously in (3) we penalized the difference between pairs slices whereas in (4) we penalize the differences between pairs of mode-d subarrays.
Note that the function F γ (U) defined in (4) has a unique global minimizer. This follows immediately from the fact that F γ (U) is strongly convex. The unique global minimizer of F γ (U) is our proposed CoCo estimator, which is denoted by U for the remainder of the paper.
At times it will be more convenient to work with vectors rather than tensors. By applying the identity in (1), we can rewrite the objective function in (4) in terms of the vectorizations of U and X as follows where A d,ij is the n −d -by-n matrix A d, ij = I n D ⊗ ⋯ ⊗ I n d + 1 ⊗ Δ d, ij ⊗ I n d − 1 ⊗ ⋯ ⊗ I n 1 where I n d is the n d -by-n d identity matrix. We will refer to the unique global minimizer of (5), û = argmin u F γ (u), as the vectorized version of our CoCo estimator.  Sharpnack et al., 2012;Zhu et al., 2014;Chi and Lange, 2015;Radchenko and Mukherjee, 2017). Most prior work on convex clustering employ an element-wise ℓ 1 -norm penalty on pairwise differences, as in the original fused lasso, however, ℓ 2 -norm and ℓ ∞ -norm have also been considered (Hocking et al., 2011;Chi and Lange, 2015). In this paper, we restrict ourselves to the ℓ 2 -norm for two reasons. First, the ℓ 2 -norm is rotationally invariant. In general, we are reluctant to adopt a procedure whose co-clustering output may non-trivially change when the coordinate representation of the data along one of its modes is trivially changed. Second, the ℓ 2 -norm promotes the group-wise shrinkage of pairwise differences of subarrays along each mode leading to more straightforward partitioning along each mode. Pairwise differences are either exactly zero or not. When the tensor is a matrix and the rows and columns are being simultaneously clustered, we recover the objective function in the convex biclustering problem (Chi et al., 2017). In general, the fusion penalties R d (U) shrink solutions to vector valued functions that are piece-wise constant over the mode-d similarity graph defined by the weights w d,ij . Viewed this way, we can see our approach as simultaneously performing the network lasso (Hallac et al., 2015) on D similarity graphs.

Remark 2
The CoCo estimator is invariant to permutations in the data tensor X in the following sense. Suppose U and U ′ are the CoCo estimators when the data tensors are respectively X and X′ = X × 1 Π 1 × 2 ⋯ × D Π D where Π 1 ∈ 0, 1 n 1 × n 1 , …, Π D ∈ 0, 1 n D × n D are permutation matrices, namely Π d T Π d = I. In words, X′ can be obtained from X by permuting the subarrays of X along the dth mode according to Π d for d = 1, … , D, and X can be recovered from X′ by permuting along the dth mode according to Π d T for d = 1, … , D. Since ‖U × 1 Π 1 × 2 ⋯ × D Π D ‖ F = ‖U‖ F , it follows that Permutation invariance is important because it means that the CoCo estimator is essentially unaltered by any reshuffling along the modes of the data tensor.
Remark 3 Given the co-clustering structure assumed in (2), one may wonder how much is added by explicitly seeking a co-clustering over clustering along each mode independently. In other words, why not solve D independent convex clustering problems with R(U) = R d (U)? To provide some intuition on why co-clustering should be preferred over independently clustering each mode, consider the following problem. Imagine trying to cluster row vectors x i ∈ ℝ 10, 000 for i = 1, … , 100 drawn from a two-component mixture of Gaussians, namely x i iid 1 2 N μ, σ 2 I + 1 2 N ν, σ 2 I . This is a challenging clustering problem due to the disproportionately small number of observations compared to the number of features. If, however, we were told that μ j = μ 1 and ν j = ν 1 for j = 1, … , 5,000 and μ j = μ 2 and ν j = ν 2 for i = 5,001, … , 10,000, in other words that the features were clustered into two groups, our fortunes have reversed and we now have an abundance of observations compared to the number of effective features. Even if we lack a clear-cut clustering structure in the features, this example suggests that leveraging similarity structure along the columns can expedite identifying similarity structure along the rows, and vice versa. Indeed, if there is an underlying checkerbox mean tensor we may expect that simultaneously clustering along each mode should make the task of clustering along any one given mode easier. Our prediction error result presented in Section 4.2 in fact supports this suspicion (See Remark 10).

Properties
We first discuss how the CoCo estimator U behaves as a function of the data tensor X, the tuning parameter γ, and the weights w d,ij . We will then present its statistical properties under mild conditions on the data generating process. We highlight that these properties hold regardless of the algorithm used to minimize (4), as they are intrinsic to its convex formulation. All proofs are given in Appendix B and Appendix C.

Stability Properties
The CoCo estimator varies smoothly with respect to X, γ, and {w d,ij }. Let W d = {w d,ij } denote the weights matrix for mode d.
As noted earlier, in practice we will typically fix the weights w d,ij and compute the CoCo estimator over a grid of the penalization parameters γ in order to select a final CoCo estimator from among the computed candidate estimators of varying levels of smoothness. Since (4) does not admit a closed form minimizer, we resort to iterative algorithms for computing the CoCo estimator. Continuity of U in γ can be leveraged to expedite computation through warm starts, namely using the solution U γ as the initial guess for iteratively computing U γ′ where γ′ is slightly larger or smaller than γ. Due to the continuity of U in γ, small changes in γ will result in small changes in U. Empirically the use of warm starts can lead to a non-trivial reduction in computation time . From the continuity in γ, we also see that convex co-clustering performs continuous co-clustering just as the lasso (Tibshirani, 1996) performs continuous variable selection.
The penalization parameter γ tunes the complexity of the CoCo estimator. Clearly when γ = 0, the CoCo estimator coincides with the data tensor, namely U = X. the solution to which is just the global mean x, whose entries are all identically the average value of x over all its entries. The next result formalizes our intuition that as γ increases, the CoCo estimator will eventually coincide with x.
Proposition 6 Suppose Assumption 4.1 holds for d = 1, … ,D, then F γ (U) is minimized by the grand mean X for γ sufficiently large.
Thus, as γ increases from 0, the CoCo estimator U traces a continuous solution path that starts from n co-clusters, consisting of u i 1 ⋯i D = x i 1 ⋯i D , to a single co-cluster, where For a fixed γ, we can derive an explicit bound on sensitivity of the CoCo estimator to perturbations in the data.

Proposition 7
The minimizer U of (4) is a nonexpansive or 1-Lipschitz function of the data tensor X, namely Nonexpansivity of U in X provides an attractive stability result. Since U varies smoothly with the data, small perturbations in the data are guaranteed to not lead to large variability of U, or consequently large variability in the cluster assignments. In a special case of our method, Chi et al. (2017) showed empirically that the co-clustering assignments made by the 2-way version of the CoCo estimator was noticeably less sensitive to perturbations in the data than those made by several existing biclustering algorithms.

Statistical Properties
We next provide a finite sample bound for the prediction error of the CoCo estimator. For simplicity, we consider the case where we take uniform weights within a mode in (5), namely w d,ij = w d,i′j′ = 1/n d for all i, j, i′, j′ ∈ {1, … , n d }. Such uniform weight assumption has also been imposed in the analysis of the vector-version of convex clustering (Tan and Witten, 2015).
In order to derive the estimation error of û, we first define an important definition for the noise and introduce two regularity conditions. (2015)) We say a random vector y ∈ ℝ n is M-concentrated if there are constants C 1 , C 2 > 0 such that for any convex, 1-Lipschitz function ϕ: ℝ n ℝ and any t > 0,

Definition 8 (Vu and Wang
The M-concentrated random variable is more general than the Gaussian or sub-Gaussian random variables, and it allows dependence in its coordinates. Vu and Wang (2015) provided a few examples of M-concentrated random variables. For instance, if the coordinates of y are iid standard Gaussian, then y is 1-concentrated. If the coordinates of y are independent and M-bounded, then y is M-concentrated. If the coordinates of y come from a random walk with certain mixing properties, then y is M-concentrated for some M.

Assumption 4.2 (Model)
We assume the true cluster center C* ∈ ℝ k 1 × ⋯ × k D has a checkerbox structure such that the mode-d subarrays have k d different values (number of clusters along the dth mode), and each entry of C* is bounded above by a constant C 0 > 0.
Define U* ∈ ℝ n 1 × ⋯ × n D as the true parameter expanded based on C*, namely where M d ∈ 0, 1 n d × k d are binary mode-d cluster membership matrices such that M d 1 = 1.
Denote u* = vec U* ∈ ℝ n with n = ∏ d = 1 D n d . We assume the samples belonging to the (r 1 , … , r D )-th cluster satisfy The checkerbox means model in Assumption 4.2 provides the underlying cluster structure of the tensor data. As a special case, Assumption 4.2 with D = 2 reduces to the model assumption underlying convex biclustering (Chi et al., 2017). In contrast to the independent sub-Gaussian condition assumed in vector-version convex clustering (Tan and Witten, 2015), our error condition is much weaker since we allow for non-sub-Gaussian distributions as well as allow for dependence among its coordinates.
Theorem 9 Suppose that Assumption 4.2 and Assumption 4.3 hold. The estimation error of û in (5) with uniform weights satisfies, with a high probability, where C = 12c 0 C 0 2 is a positive constant, and k d is the true number of clusters in the dth mode.
Theorem 9 provides a finite sample error bound for the proposed CoCo tensor estimator. Our theoretical bound allows the number of clusters in each mode to diverge, which reflects a typical large-scale clustering scenario in big tensor data. A notable consequence of Theorem 9 is that, when D ≥ 3, namely a higher-order tensor with at least 3 modes, the CoCo estimator can achieve estimation consistency along all the D modes even when we only have one tensor sample. Here the sample size refers to the number of available tensor samples. In our tensor clustering problem, we only have access to one tensor sample.
This property is uniquely enjoyed by co-clustering of tensor data with D ≥ 3, and has not been previously established in the existing literature on vector clustering or biclustering. To see this, when n d are of the same order as n 0 , and k d are of the same order as k 0 , a sufficient condition for the consistency is that n 0 → ∞ and k 0 = o n 0 (D − 2)/(D − 1) up to a log term. When D = 3, the CoCo estimator is consistent so long as the number of clusters k 0 in each mode diverges slightly slower than n 0 . Remarkably, as we have more modes in the tensor data, this constraint on the rate of divergence of k 0 gets weaker. In short, we reap a unique and surprisingly welcome "blessing of dimensionality" phenomenon in the tensor co-clustering problem.
Remark 10 Next we discuss the connections of our bound (7) with prior results in the literature. An intermediate step in the proof of Theorem 9 indicates that the estimation error in the dth mode is on the order of 1/n d + log(n)/ nn d + log(n) n d ∏ j ≠ d k j /n −d . In the clustering along the rows of a data matrix, our rate matches with that established for vector-version convex clustering (Tan and Witten, 2015), up to a log term log(n). Such a log term is due to that fact that Tan and Witten (2015) considers the error to be iid sub-Gaussian while we consider a general M-concentrated error. In practice, the iid assumption on the noise ϵ = vec(E) could be restrictive. Consequently, our theoretical analysis is built upon a new concentration inequality of quadratic forms recently developed in Vu and Wang (2015). In addition, our rate reveals an interesting theoretical property of the convex biclustering method proposed by Chi et al. (2017). When D = 2, our rate indicates that the estimation error along the row and column of the data matrix is log n 1 n 2 n 1 k 2 /n 2 and log n 1 n 2 n 2 k 1 /n 1 , respectively. Clearly, both errors can not converge to zero simultaneously. This indicates a disadvantage of matricizing a data tensor for co-clustering.

Estimation Algorithm
We next discuss a simple first order method for computing the solution to the convex co-clustering problem. The proposed algorithm generalizes the variable splitting approach introduced for convex clustering problem described in Chi and Lange (2015) to the CoCo problem. The key observation is that the Lagrangian dual of an equivalent formulation of the convex co-clustering problem is a constrained least squares problem that can be iteratively solved using the classic projected gradient algorithm.

A Lagrangian Dual of the CoCo Problem
Recall that we seek to minimize the objective function in (5) Note that we have enumerated the edge indices in E d to simplify the notation for the following derivation.
We perform variable splitting and introduce the dummy variables v d, T T denote the vector obtained by stacking the vectors v d on top of each other. We now solve the equivalent equality constrained minimization where A d = I n D ⊗ ⋯ ⊗ I n d + 1 ⊗ Φ d ⊗ I n d − 1 ⊗ ⋯ ⊗ I n 1 and Φ d is the oriented edge-vertex incidence matrix for the dth mode graph, namely We introduce dual variables λ d corresponding to the equality constraint v d = A d u. Let Λ d denote the n −d × |E d | matrix whose lth column is λ d,l . Further denote the vectorization of Λ d by λ d = vec(Λ d ) and λ = λ 1 T λ 2 T ⋯ λ D T T . The Lagrangian dual objective is given by Maximizing the dual objective G(λ) is equivalent to solving the following constrained least squares problem: where C = λ: λ d, l ∈ C d, l , l ∈ E d , d = 1, …, D . We can recover the primal solution via the relationship: where λ is a solution to the dual problem (8). The dual problem (8) has at least one solution by the Weierstrass extreme value theorem, but the solution may not be unique since A T has a non-trivial kernel. Nonetheless, our CoCo estimator û is still unique since A T λ 1 = A T λ 2 for any solutions λ 1 , λ 2 to the problem (8).
We numerically solve the constrained least squares problem in (8) with the projected gradient algorithm, which alternates between taking a gradient step and projecting onto the set C. Algorithm 1 provides pseudocode of the projected gradient algorithm, which has several good features. The projected gradient algorithm is guaranteed to converge to a global minimizer of (8  , 2014, 2015). Additional details on the derivation of the algorithmic updates, convergence guarantees, computational and storage costs, as well as stopping rules can be found in Appendix E.

Specifying Non-Uniform Weights
In Section 4.2, we assumed uniform weights w d,ij in the penalty terms R d (U) to establish a prediction error bound, which revealed a surprising and beneficial "blessing of dimensionality" phenomenon. Although this simplifying assumption gives clarity and insight into how the co-clustering problem gets easier as the number of modes increases, in practice choosing non-uniform weights can substantially improve the quality of the clustering results. In the context of convex clustering, Chen et al. (2015) and Chi and Lange (2015) provided empirical evidence that convex clustering with uniform weights struggled to produce exact sparsity in the pairwise differences of smooth estimates when there was not a strong separation between groups. Indeed, similar phenomena were observed in earlier work on the related clustered lasso (She, 2010). Several related works (She, 2010;Hocking et al., 2011;Chen et al., 2015;Chi and Lange, 2015) recommend a weight assignment strategy described below. In addition, the use of sparse weights can also lead to non-trivial Step Step end for end for until convergence To illustrate the practical value of non-uniform weights, we compare CoCo's ability to recover co-clusters, using both uniform and non-uniform weights, as the size of a 3-way tensor increases when there are two clusters per mode with balanced cluster sizes along each mode. We assess the quality of the recovered clustering performance using the Adjusted Rand Index (ARI). The ARI (Hubert and Arabie, 1985) varies between −1 and 1, where 1 indicates a perfect match between two clustering assignments whereas a value close to zero indicates the two clustering assignments match about as might be expected if they were both randomly generated. Negative values indicate that there is less agreement between clusterings than expected from random partitions. Figure 3 shows a comparison between using non-uniform weights that are described in Section 6.2 and uniform weights. Each plotted point in Figure 3 is the average ARI over 100 replicates. For CoCo using non-uniform weights, the smoothing parameter γ is chosen with the data-driven extended BIC method that is detailed in Section 7.1. In contrast, for CoCo using uniform weights, γ is chosen as the value that produces the estimator that minimizes the true but unknown MSE.
We see that while using uniform weights in CoCo leads to recovering co-clusters exactly once a sufficient number of samples have been acquired, using non-uniform weights enables CoCo to recover the co-clusters exactly with notably fewer samples. The results of this experiment are especially remarkable because CoCo using non-uniform weights and a dataadaptive choice of γ outperformed CoCo using uniform weights and an ideally chosen oracle value of γ.
As in the case of convex clustering, using non-uniform weights can lead to significantly better performance over using uniform weights in practice. We give some explanation for why this is expected in Section 6.3 but leave it to future work to develop theory proving this performance improvement. Nonetheless based on this observation, we employ non-uniform weights in CoCo for the empirical studies presented later in the paper.

Basic Procedure for Specifying Weights
We first describe our basic two step procedure for constructing weights before elaborating on the final refinements used in our numerical experiments.
Step 1: We first calculate pre-weights w d, ij between the ith and jth mode-d subarrays as The first factor on the right hand side of equation (9), ι i, j k , is an indicator function that equals 1 if the jth slice is among the ith slice's k-nearest neighbors (or vice versa) and 0 othewise. The purpose of this term is to control the sparsity of the weights. The corresponding tuning parameter k influences the connectivity of the mode-d similarity graph. One can explore different levels of granularity in the clustering by varying k (Chen et al., 2015). As a default, one can use the smallest k such that the similarity graph is still connected. Note it is not necessary to calculate the exact k-nearest neighbors, which scales quadratically in the number of fibers in the mode. A fast approximation to the k-nearest neighbors is sufficient for the sake of inducing sparsity into the weights. Chi and Lange (2015) provided two reasons for using k-nearest neighbor weights. First, we wish to prioritize fusions between pairs of subarrays that are most similar; the subarrays that are most dissimilar should be the last pair of subarrays to fuse as the smoothing parameter γ increases. Second, we wish to use a sparse similarity graph as the computational and storage complexity of the estimation algorithm is proportional to the number of non-zero edges in the similarity graphs (Appendix E). Using k-nearest-neighbors weights accomplishes both goals.
The second factor on the right hand side of equation (9) is the Gaussian kernel, which takes on larger values for pairs of mode-d subarrays that are more similar to each other. Chi and Steinerberger (2019) give a detailed theoretical justification for using weights like the Gaussian kernel weights in the context of convex clustering. For space considerations, we refer readers interested in these technical details to their work and give a brief intuitive rationale for the employing the Gaussian kernel here. Intuitively, the weights should be inversely proportional to the distance between the ith and jth mode-d subarrays (Chen et al., 2015;Chi et al., 2017). The inverse of the nonnegative parameter τ d is a measure of scale. In practice, we can set it to be the median Euclidean distance between the ith and jth mode-d subarrays that are k-nearest neighbors of each other. A value of τ d = 0 corresponds to uniform weights. Note that with minor modification, we can make the inverse scale parameter to be pair dependent as described in Zelnik-Manor and Perona (2005).
Step 2: To obtain the mode-d weights w d,ij , we normalize the mode-d pre-weights w d, ij to sum to n d /n. The normalization step puts the penalty terms R d (U) on the same scale and ensures that clustering along any given single mode will not dominate the entire co-clustering as γ increases.

Improving Weights via the Tucker Decomposition
In our preliminary experiments, we found that substituting a low-rank approximation of X, namely a Tucker decomposition X, in place of X in (9) led to a marked improvement in coclustering performance. To understand the boost in performance suppose that X = U* + E with U* having a checkerbox structure and the entries of E are iid N(0, σ 2 ) for simplicity.
Further suppose that the ith and jth mode-d subarrays of U ⋆ belong to the same partition and is distributed as a χ 2 random variable with n d degrees of freedom.
If we were able to perfectly denoise the tensor X so that σ = 0, then the pre-weight w d, ij would be set to its maximal value of 1, the ideal value for w d, ij since we have assumed the ith and jth mode-d subarrays belong to the same partition. Thus, if we can reduce σ 2 , namely denoise the observed tensor X, we can approach the ideal value of pre-weights. Note that we are more focused with approaching the ideal pre-weight values for pairs of subarrays that belong to the same partition and not concerned with pairs of subarrays in different partitions as the Gaussian kernel weights decay very rapidly. The Tucker decomposition is effective at reducing σ 2 when U* has a checkerbox pattern as the checkerbox pattern is a low-rank tensor that can be effectively approximated with the Tucker decomposition.
Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. In our simulation studies described in Section 8, we use two different methods for choosing the rank as a robustness check to ensure our CoCo estimator's performance does not crucially depend on the rank selection method. Details on these two methods can be found in Appendix F. While we found the Tucker decomposition to work well in practice, we suspect that other methods of denoising the tensor may work just as well or could possibly be more effective. We leave it to future work to explore alternatives to the Tucker decomposition.

Weights and Folded-Concave Penalties
We conclude our discussion on weights by highlighting how they provide a connection between convex clustering and other penalized regression-based clustering methods that use folded-concave penalties Xiang et al., 2013;Zhu et al., 2013;Marchetti and Zhou, 2014;Wu et al., 2016a). Suppose we seek to minimize the objective The inequality (11) indicates that the first order Taylor expansion of a differentiable concave function φ d provides a tight global upper bound at the expansion point z. Thus, we can construct a function that is a tight upper bound of the function f γ (u) where the constant c does not depend on u and w d,ij are weights that depend on ũ, namely Note that if we take ũ to be the vectorization of the Tucker approximation of the data, vec(X), and φ d (z) to be the following variation on the error function then the function given in (10) coincides with the CoCo objective using the prescribed Tucker derived Gaussian kernel weights.
The function g γ (u | ũ) is said to majorize the function f γ (u) at the point ũ (Lange et al., 2000) and minimizing it corresponds to performing one-step of the local linearapproximation algorithm (Zou and Li, 2008;Schifano et al., 2010) which is a special case of the majorization-minimization (MM) algorithm (Lange et al., 2000). The corresponding MM algorithm would consist of repeating the following two steps: (i) using a previous CoCo estimate U to compute weights w d,ij according to (13), and (ii) computing a new CoCo estimate using the new weights. In practice, we have found one-step to be adequate, however. Indeed, Zou and Li (2008) showed that the solution to the one-step algorithm was often sufficient in terms of its statistical estimation accuracy.

Other Practical Issues
In this section, we address other considerations for using the method in practice, namely how to choose the tuning parameter γ and how to recover the partitions along each mode from the CoCo estimator U.

Choosing γ
The first major practical consideration is how to choose γ to produce a final co-clustering result. Since co-clustering is an exploratory method, it may be suitable for a user to manually inspect a sequence of CoCo estimators U γ for a range of γ and use domain knowledge tied to a specific application to select γ to recover a co-clustering assignment of a desired complexity. Since this approach is time consuming and requires expert knowledge, an automated, data-driven procedure for selecting γ is desirable. Cross-validation (Stone, 1974;Geisser, 1975) and stability selection (Meinshausen and Bühlmann, 2010) are popular techniques for tuning parameter selection, but since both methods are based on resampling, they are unattractive in the tensor setting due to the computational burden. We turn to the extended Bayesian Information Criterion (eBIC) proposed by Chen (2008, 2012), as it does not rely on resampling and thus is not as computationally costly as cross-validation or stability selection.
where RSS γ is the residual sum of squares ‖X − U γ ‖ F 2 and df γ is the degrees of freedom for a particular value of γ. We use the number of co-clusters in the CoCo estimator U γ as an estimate of df γ , which is consistent with the spirit of degrees of freedom since each co-cluster mean is an estimated parameter. This criterion balances between model fitting and model complexity, and a similar version has been commonly employed in tuning parameter selection of tensor data analysis (Zhou et al., 2013;Sun et al., 2017).
The eBIC is calculated on a grid of values S = γ 1 , γ 2 , …γ s , and we select the optimal γ, denoted γ*, which corresponds to the smallest value of the eBIC over S, namely γ ⋆ = arg min γ ∈ S eBIC(γ) .

Recovering the Partitions along Each Mode
The second major practical consideration is how to extract the partitions from the CoCo estimator U. Recall that the ith and jth mode-d subtensors belong to the same partition if v d, ij = U × d Δ ij = 0. Conversely, the ith and jth mode-d subtensors do not belong to the same partition if v d,ij ≠ 0. Thus, a mode-d partition consists of the maximal set of mode-d subarrays such that for any pair i and j in this collection v d,ij = 0. We can automatically identify these maximal sets by extending a simple procedure employed by Chi and Lange (2015) for extracting clusters in the convex clustering problem. Identifying partitions along the dth mode is equivalent to finding connected components of a graph, where each node corresponds to a subarray along the dth mode, and there is an edge between nodes i and j if and only if v d,ij = 0.
We would like to read off which centroids have fused as the amount of regularization increases, namely determine partition assignments as a function of γ. Such assignments can be performed in O n d operations, using the differences variable V d . We simply apply breadth-first search to identify the connected components of the following graph induced by the V d . The graph identifies a node with every data point and places an edge between the lth pair of points if and only if v l = 0. Each connected component corresponds to a partition. Note that the graph constructed to determine partitions is not the same as the graph described in Section 3 with illustrative examples in Figure 2.
We emphasize that the recovered partition along each mode does not depend on the ordering of the input data X, since it is based off of the pairwise differences along each mode, namely V d for d = 1, … ,D. Finally, we note that due to finite precision limitations, the difference variables v d,ij will likely not be exactly 0. In Appendix E.4, we detail a simple and principled procedure for ensuring sparsity in these difference variables.

Simulation Studies
To investigate the performance of the CoCo estimator in identifying co-clusters in tensor data, we first explore some simulated examples. We compare our CoCo estimator to a kmeans based approach that is representative of various tensor generalizations of the spectral clustering method common in the tensor clustering literature (Kutty et al., 2011;Liu et al., 2013b;Zhang et al., 2013;Wu et al., 2016b). We refer to this method as CPD+k-means. The CPD+k-means method (Papalexakis et al., 2013; Sun and Li, 2019) first performs a rank-R CP decomposition on the D-way tensor X to reduce the dimensionality of the problem, and then independently applies k-means clustering to the rows of each of the D factor matrix from the resulting CP decomposition. The k-means algorithm has also been used to cluster the factor matrices resulting from a Tucker decomposition (Acar et al., 2006;Sun et al., 2006;Kolda and Sun, 2008;Sun et al., 2009;Kutty et al., 2011;Liu et al., 2013b;Zhang et al., 2013;Cao et al., 2015;Oh et al., 2017). We also considered this Tucker+k-means method in initial experiments, but its co-clustering performance was inferior to that of CPD+k-means so we only report co-clustering performance results for CPD+k-means in the comparison experiments that follow. Note, however, that we still use the Tucker decomposition to compute CoCo weights w d,ij as described Section 6. Both CoCo and CPD+kmeans account for the multiway structure of the data. To assess the importance of accounting for this structure, we also include comparisons with the CoTeC method (Jegelka et al., 2009), which applied k-means clustering along each mode and does not account for the multiway structure of the data.
All methods being compared have tuning parameters that need to be set. For the rank of the CP decomposition needed in CPD+k-means, we consider R ∈ {2, 3, 4, 5} and use the tuning procedure in Sun et al. (2017) to automatically select the rank. A CP decomposition is then performed using the chosen rank, and those factor matrices are the input into the k-means algorithm. A well known drawback of k-means is that the number of clusters k needs to be specified a priori. Several methods for selecting k have been proposed in the literature, and we use the "gap statistic" developed by Tibshirani et al. (2001) to select an optimal k* from the specified possible values. Since CoCo estimates an entire solution path of mode-clustering results, ranging from n d clusters to a single cluster along mode d, we consider a rather large set of possible k values to make the methods more comparable.
Appendix G gives a more detailed description of the CPD+k-means procedure and the selection of its tuning parameters. CoTeC, which applies k-means clustering along each mode independently, also requires specifying the number of cluster along each mode. As in CPD+k-means, we also select this parameter along each mode using the "gap statistic." As described in Section 6, we employ a Tucker approximation to the data tensor in constructing weights w d,ij . In computing the Tucker decomposition we used one of two methods for selecting the rank. In the plots within this section, TD1 denotes the results where the Tucker rank was chosen using the SCORE algorithm (Yokota et al., 2017), while TD2 denotes results where the rank was chosen using a heuristic. Detailed discussion on these two methods are in Appendix F.
The results presented in this section report the average CoCo estimator performance quantified by the ARI across 200 simulated replicates. All simulations were performed in MATLAB using the Tensor Toolbox (Bader et al., 2015). All the following plots, except the heatmaps in Figure 13, were made using the open source R package ggplot2 (Wickham, 2009).

Cubical Tensors, Checkerbox Pattern
For the first and main simulation setting, we study clustering data in a cubical tensor generated by a basic checkerbox mean model according to Assumption 4.2. Each entry in the observed data tensor is generated according to the underlying model (2) with independent errors ϵ i 1 i 2 i 3 N 0, σ r 1 r 2 r 3 2 . Unless specified otherwise, there are two true clusters along each mode for a total of eight underlying co-clusters.

Balanced Cluster Sizes and Homoskedastic Noise-To get an initial feel
for how the different co-clustering methods perform at recovering the true underlying checkerbox structure, we first consider a situation where the clusters corresponding to the two classes along each mode are all equally-sized, or balanced, and share the same error variance, namely σ r 1 r 2 r 3 = σ for all r 1 , r 2 , and r 3 . The average co-clustering performance for this setting in a tensor with dimensions n 1 = n 2 = n 3 = 60 are given in Figure 4 for different noise levels. Figure 4 shows that all three methods perform well when the noise level is low (σ = 1). As the noise level increases, however, CPD+k-means experiences an immediate and noticeable drop off in performance. CoTeC's performance decays even more rapidly highlighting the importance of accounting for multiway structure. The CoCo estimator, on the other hand, is able to maintain near-perfect performance until the noise level becomes rather high (σ = 8). Figure 5 shows how the run times of CoCo and CPD+k-means vary as the size of a cubic tensor, n = n 1 n 2 n 3 with n 1 = n 2 = n 3 takes on the values 20 3 , 30 3 , 60 3 , and 100 3 . These run times include all computations needed to fit and select a final model. For CoCo, a sequence of models were fit over a grid of γ parameters, and a final γ parameter was chosen using the eBIC. For CPD+k-means, a sequence of models were fit over a grid of possible (k 1 , k 2 , k 3 ) parameters corresponding to the 3 factor matrices, and a final triple of (k 1 , k 2 , k 3 ) parameters were chosen using the "gap statistic." Timing comparisons were performed on a 3.2 GHz quad-core Intel Core i5 processor and 8 GB of RAM. The run time for CoCo scales linearly in the size of the data tensor as expected, namely proportionately with n 1 3 .
Nonetheless, as also might be expected, the clustering performance enjoyed by CoCo does not come for free, and the simpler but less reliable CPD+k-means algorithm enjoys a better scaling as the tensor size grows. Timing results were similar for the following experiments and are omitted for space considerations.

Imbalanced Cluster
Sizes-When comparing clustering methods, one factor of interest is the extent to which the relative sizes of the clusters impact clustering performance.
To investigate this, we again use a cubical tensor of size n 1 = n 2 = n 3 = 60 but introduce different levels of cluster size imbalance along each mode, which we quantify via the ratio of the number of samples in cluster 2 of mode d and the total number of samples along mode d, for d = 1, 2, 3. Figure 6a shows that when the noise level is low, CPD+k-means is unaffected by the imbalance until the size of cluster 2 is less than 30% of the mode's length. At this point, the performance of CPD+k-means drops off significantly and it performs as well as a random clustering assignment when the sizes are highly skewed (n d2 /n d = 0.1). The CoCo estimator is more or less invariant to the imbalance, and its performance is almost perfect across all levels of cluster size imbalance. Figure 6b shows that the CoCo estimator exhibits a slight deterioration in performance only when the cluster size ratio is 0.1 in the high noise case. In both low and high noise scenarios, CoTeC performs poorly.

Heteroskedastic Noise-Another
factor of interest is how the clustering methods perform when there is heteroskedasticity in the variability of the two classes. Figure 7 displays the co-clustering performance for different degrees of heteroskedasticity, as measured by the standard deviation for class 2 relative to class 1's standard deviation, σ 2 /σ 1 . In the low noise setting, the CoCo estimator is immune to the heteroskedasticity until the noise levels differ by a factor of 4. CPD+k-means in contrast is very sensitive to a deviation from homoskedasticty, experiencing a decline even when the noise ratio increases from 1 to only 1.5. The CoCo estimator fares worse in the high noise setting and also has a drop in performance with a small deviation from homoskedasticty. Once class 2's standard deviation is more than double the standard deviation for class 1, all three methods are essentially the same as random clustering. This result is not terribly surprising since, in the high noise setting, this would result in one class having a very high standard deviation of σ 2 = 12. In both low and high noise scenarios, CoTeC performs poorly.

Different Clustering
Structures-So far, we have considered only a simple situation where there are exactly two true clusters along each mode, for a total of eight triclusters. Another factor of practical importance is how the clustering methods perform when there are more than two clusters per mode, and also when the number of clusters along each mode differs. We investigate both of these settings in this section. As before, the tensor is a perfect cube with n 1 = n 2 = n 3 = 60 observations along each mode and an underlying checkerbox pattern. To gauge the performance, we again focus the attention on how the methods perform in the presence of both low and high noise.
The first situation studied is one in which there are three true clusters along each mode, resulting in a total of 27 triclusters. The left hand side of the graphs in Figure 8 show the results from this simulation setting. The graphs show that CoCo estimator consistently outperforms CPD+k-means and CoTeC in this setting across both noise levels. The CoCo estimator is able to recover the true co-clusters almost perfectly, while CPD+k-means struggles to handle the increased number of clusters per mode.
We also investigated the clustering performance when the number of clusters per mode varies. In this setting, there are two, three, and four clusters along modes one, two, and three, respectively. From the right hand side of the graphs in Figure 8, we can see that the results are similar to the situation with three clusters per mode. CPD+k-means again performs very poorly across both noise levels, while convex co-clustering is again able to essentially recover the true co-clustering structure. Compared to the setting with three clusters per mode, CPD+k-means performs slightly worse in the face of a more complex clustering structure, while convex co-clustering is able to handle it in stride. These results bode well for convex co-clustering as the basic clustering structure of only two clusters per mode is unlikely to be observed in practice.

Rectangular Tensors
Up to this point, to get an initial feel for CoCo's performance, we restricted our attention to cubical tensors with the same number of observations per mode so as to avoid changing too many factors at once. It is unlikely that the data tensor at hand will be a perfect cube, however, so it is important to understand the clustering performance when the methods are applied to rectangular tensors. Now we turn to cluster a rectangular tensor with one short mode and two longer modes. Two additional simulations involving rectangular tensors can be found in Appendix H. Figure 9 shows that CoCo performs very well and better than CPD+k-means and CoTeC at the lower noise level (σ = 3) but has a sharp decrease in ARI at the higher noise level (σ = 4). The decline is more pronounced for the longer modes (Figure 9b and Figure 9a) as the short mode ( Figure 9a) is still able to maintain perfect performance despite the increase in noise. This is not surprising, since the shorter mode has effectively more samples. Moreover, we see the "blessing of dimensionality" at work when the number of samples along the short mode are doubled (n 1 = 20, n 2 = n 3 = 50), the performance along the two longer modes improves drastically in the high noise setting.
We finally note that, along the shorter mode, the use of the heuristic in determining the rank of the Tucker decomposition for calculating the weights performs better than the SCORE algorithm method along modes 1 and 2, though ultimately the co-clustering performance is comparable. This may indicate that the SCORE algorithm struggles to correctly identify the optimal Tucker rank for short modes in the presence of relatively higher noise, while the heuristic is more immune to the noise level as it is based simply on the dimensions of the tensor.

CANDECOMP/PARAFAC Model
In Section 8.1, we saw that the CoCo estimator performs well and typically better than CPD+k-means when clustering tensors whose co-clusters have an underlying checkerbox pattern. To evaluate the performance of our CoCo estimator under model misspecification, we consider the generative model as the following CP decomposition model. We first construct the factor matrix A ∈ ℝ 80 × 2 and construct the following rank-2 CP means tensor where • denotes the outer product. We then added varying levels of Gaussian noise to the U* to generate the observed data tensor. We consider two different types of factor matrices.
As shown in Figure 10, one shape consists of two half-moon clusters (Hocking et al., 2011;Chi and Lange, 2015;Tan and Witten, 2015) while the other shape contains a bullseye, similar to the two-circles shape studied by Ng et al. (2002) and Tan and Witten (2015). In either case, the triangles in Figure 10 correspond to the first 40 rows of A, whereas the circles correspond to the second 40 rows of A. Note that this data generating mechanism should favor the CPD+k-means method. Figure 11 shows the simulation results for using the CP model with these two non-convex shapes generating the data. The discrepancy in performance between the CoCo estimator and the other two methods is quite large. The CoCo estimator almost perfectly identifies the true co-clusters. In contrast, both CPD+k-means and CoTeC perform very poorly, even when the noise variance is small. The poor performance of CPD+k-means and CoTeC are not completely surprising as other have noted the difficulty that k-means methods have in recovering non-convex clusters (Ng et al., 2002;Hocking et al., 2011;Tan and Witten, 2015). These results give us some assurances that the CoCo estimator is able to still perform well even under some model misspecification since the true co-clusters do not have a checkerbox pattern.

Comparison with Convex Biclustering
It is natural to ask how much additional gain there is in using CoCo over convex biclustering (Chi et al., 2017) on the matricizations of a data tensor. To answer this question, we compare CoCo to the following strategy for applying convex biclustering to estimate coclusters. We explain the strategy for a 3-way tensor; the generalization to D-way tensors is straightforward. We first matricize the tensor X along mode-1 to obtain the matrix X (1) , apply convex biclustering on X (1) , and retain the mode-1 clustering results. Note that the mode-2 and mode-3 fibers have been mixed together through the matricization process. We then repeat the two-step procedure for mode-2 and mode-3. The final co-cluster estimates are obtained by taking the cross-products of the mode-1, mode-2, and mode-3 cluster assignments.
We consider two illustrative scenarios to understand the value of preserving the full multiway structure with CoCo: a balanced case and imbalanced case. In the balanced case, we have a 3-way data tensor x ∈ ℝ 60 × 60 × 60 with two clusters along each mode, where clusters are of equal size and homoskedastic iid Gaussian noise has been added to all elements of the tensor. This scenario is similar to the one shown in Figure 4. In the imbalanced case, we have a 3-way data tensor x ∈ ℝ 30 × 40 × 80 . There are two clusters along mode-1 of sizes 10 and 20, three clusters along mode-2 of sizes 8, 12, and 20, and four clusters along mode-3 of sizes 5, 10, 20, and 45. Homoskedastic iid Gaussian noise has been added to all elements of the tensor. Finally, we note that the empirical performance of convex biclustering, like that of CoCo's, depends on choosing good weights for the rows and columns of the input data matrix (Chi et al., 2017). To create a fair comparison, we construct convex biclustering weights based off of the same TD1 and TD2 denoising procedure used for CoCo, putting the preprocessing for both methods on equal footing. Figure 12a and Figure 12b show the co-clustering performance of CoCo and the convex biclustering method in the balanced and imbalanced cases respectively. We see that in the balanced case, CoCo's performance is marginally better than that of the convex biclustering method. On the other hand, we see that in the imbalanced case, CoCo's performance degrades more gracefully than that of the convex biclustering method as the noise level increases. The example illustrates that CoCo has better co-cluster recovery when there is more imbalance in the data tensor -the aspect ratios of the tensor dimensions are more skewed and the number of clusters and the cluster sizes are more heterogenous.
The key formulation difference between CoCo and the convex biclustering method that provides some insight into these two results is that CoCo imposes a finer level of smoothness that respects the multiway structure in the data tensor. Imposing such finer level of smoothness imparts greater robustness in the presence of increasing noise to recovering the smaller co-clusters in the imbalanced scenario. An added incentive for using CoCo and preserving the multiway structure in the data is that the gains in co-cluster recovery over the convex biclustering method do not come at a greater computational cost. Note that the computational complexity of convex biclustering is O(n), using sparse weights for the row and column similarity graphs. For a D-way tensor, the computational complexity then becomes O(Dn), which is the same as the computational complexity of CoCo applied directly on the D-way tensor.
To summarize, in comparison to the convex biclustering method, CoCo (i) does not come at additional computational costs, (ii) can recover underlying co-clustering structure in imbalanced scenarios which are more likely to be encountered in practice, and (iii) has the ability to consistently recover an underlying co-clustering structure according to Theorem 9, with even a single tensor sample, which is a typical case in real applications. Since this phenomenon does not exist in vector or matrix variate cluster analysis, the convex biclustering method lacks this theoretical guarantee.

Real Data Application
Having studied the performance of the CoCo estimator in a variety of simulated settings, we now turn to using the CoCo estimator on a real data set. The proprietary data set comes from a major online company and contains the click-through rates for advertisements displayed on the company's webpages from May 19, 2016 through June 15, 2016. The clic-kthrough rate is the number of times a user clicks on a specific advertisement divided by the number of times the advertisement was displayed. The data set contains information on 1000 users, 189 advertisements, 19 publishers, and 2 different devices, aggregated across time. Thus, the data forms a fourth-order tensor where each entry in the tensor corresponds to the click-through rate for the given combination of user, advertisement, publisher, and device. Here a publisher refers to a different webpage within the online company's website, such as the main home page versus a page devoted to either breaking news or sports scores. The two device types correspond to how the user accessed the page, using either a personal computer or a mobile device such as a cell phone or tablet computer. The goal in this real application is to simultaneously cluster users, advertisements, and publishers to improve user behavior targeting and advertising planning.
In the click-through rate tensor data, over 99% of the values are missing since one user likely has seen only a handful of the possible advertisements. If a specific advertisement is never seen by a user, it is considered as a missing value. Since the proposed CoCo estimator can only handle complete data, we first preprocess the data by imputing the missing values before any clustering can be done. To impute the missing entries, we use the CP-based tensor completion method Jain and Oh (2014) and tune its rank via the information criterion proposed by Sun et al. (2017). This tuning method chooses the optimal rank as R = 20 from the rank list {1, 2, 3, 4, 5, 6, 8, 10, 12, 14, 16, 18, 20, 22}. Finally, the imputed values are truncated to ensure all the values of the tensor are within 0 and 1 since click-through rates are proportions.
One mode of the fourth-order tensor has only two observations and those observations already have a natural grouping (device type). Therefore, for the sake of clustering we analyze the devices separately. We compare our method with CPD+k-means. Furthermore, the tuning parameter for convex co-clustering is automatically selected using the eBIC (Section 7.1) while the number of clusters in CPD+k-means is chosen via the gap statistic (Tibshirani et al., 2001). We do not include comparisons with CoTeC given its poor performance in the simulation experiments.
We first look at the clustering results from clustering the click-through rates for users accessing the advertisements through a personal computer (PC). Table 1 contains the number of clusters identified as well as the sizes of the clusters, while Figure 13a visualizes the advertisement-by-publisher biclusters for a randomly selected user. As to be expected, the advertisement-by-publisher slices display a checkerbox pattern, which turns into a checkerbox pattern when the slices are meshed together. The clustering results for the users are omitted in this paper to ensure user privacy. However, co-clustering the tensor does not result in the loss of information that would occur if the tensor was converted into a matrix by averaging across users or flattening along one of the modes. Table 1 and Figure 13a show that the CoCo estimator identifies four advertisement clusters, with one cluster being much bigger than the others. The advertisements in this large cluster have click-through rates that are close to the grand average in the data set. One of the small clusters has very low clickthrough rates, while the other two clusters tend to have much higher click-through rates than the rest of the advertisements. On the other hand, CPD+k-means clusters the advertisements into 57 groups, which is less-useful from a practical standpoint. Many of the clusters are similarly-sized and contain only a few advertisements, likely due to the inability of CPD+kmeans to handle imbalanced cluster sizes as was observed in the simulation experiments (Section 8.1.2). In terms of the publishers, the CoCo estimator identifies 3 clusters while CPD+k-means does not find any underlying grouping and simply identifies one big cluster, which again is not terribly useful (Table 1). We next provide some interpretations of the obtained clustering results of the publishers. One way online advertisers can reach more users is by entering agreements with other companies to route traffic to the advertiser's website. For example, Google and Apple have a revenue-sharing agreement in which Google pays Apple a percentage of the revenue generated by searches on iPhones (McGarry, 2016). Similarly, the online company being studied partners with several internet service providers (ISPs) to host the defaut home pages for the ISP's customers. It would make sense that these slightly different variants of the online company's main home page would have similar click-through rates, and the CoCo estimator in fact assigned these variants into the same cluster.
For users accessing the advertisements through a mobile device, such as a mobile phone or tablet computer, the CoCo estimator results for the advertisements are largely similar to the results for PCs (Table 1 and Figure 13b). There is one large cluster that contains click-through rates similar to the overall average, while the two other equally-sized clusters have relatively very low or very high click-through rates, respectively. The underlying clickthrough rates for the PC data have more variability than the mobile data, which is consistent with the identification of an additional cluster for the PC data. As before, CPD+k-means finds a large number of advertisement clusters, most of which are roughly the same size, again likely impacted by the imbalance in the cluster sizes. When compared to the personal computer device, one difference is that the cluster with the higher click-through rates for mobile devices is larger and has a higher average click-through rate than the similar clusters for the personal computer device. This finding is consistent with research by the Pew Research Center that found that click-through rates for mobile devices are higher than for advertisements viewed on a personal computer or laptop (Mitchell et al., 2012).
It is also enlightening to take a closer look at the underlying advertisements clustered across the two devices. All of the advertisements clustered in the high click-through rate cluster for the mobile devices are in the average click-through rate cluster for personal computers.
In taking a closer look at the ads in these clusters, there are several ads related to online shopping for personal goods, such as jeans, workout clothes, or neck ties. It makes sense to shop for these types of goods using a mobile device, such as while at work when it is not appropriate to do so on a work computer. Conversely, all of the advertisements in either of the two higher PC click-through rate clusters are in the large, average click-through rate cluster for the mobile devices. There are several financial-related ads in these two PC clusters, such as for mortgages or general investment advice. On the other hand, there are not many online shopping ads in those clusters, with the exception of more expensive technology-related goods that one may want to invest more time in researching before making a purchase.
In terms of the publisher clusters on mobile device, Table 1 shows that the CoCo estimator identifies two clusters of publishers while CPD+k-means identifies 13 small clusters. Contrary to the advertisement clusters, the publisher clusters across both devices are very similar. In fact, the only difference is that the smaller cluster for the mobile device, which contains seven publishers, is split into two clusters for personal computers. This can be seen in the click-through rate heatmaps given in Figure 13 in looking at the right part of each heatmap. The publishers in these smaller clusters have higher click-through rates on average than those in the larger cluster. Additionally, five of the seven (71%) publishers in the high click-through rate clusters have stand-alone apps that display ads, while only three of the twelve (25%) publishers in the larger cluster do. For mobile devices, it has been observed that in-app advertisements have higher click-through rates and browser-based ads (Hof, 2014). We conjecture that this is also true for personal computer apps, which is consistent with the clustering results. Thus it again appears that the clusters identified by CoCo also make sense practically.

Discussion
In this paper, we formulated and studied the problem of co-clustering of tensors as a convex optimization problem. The resulting CoCo estimator enjoys features in theory and practice that are arguably lacking in existing alternatives, namely statistical consistency, stability guarantees, and an algorithm with polynomial computational complexity. Through a battery of simulations, we observed that the CoCo estimator can identify co-clustering structures under realistic scenarios such as imbalanced co-cluster sizes, imbalanced number of clusters along each mode, heteroskedasticity in the noise distribution associated with each co-cluster, and even some violation of the checkerbox mean tensor assumption.
We have leveraged the power of the convex relaxation to engineer a computationally tractable co-clustering method that comes with statistical guarantees. These benefits, however, do not come for free. The CoCo estimator incurs similar costs that using the lasso incurs as a surrogate for a cardinality constraint or penalty. It is well known that the lasso leads to parameter estimates that are shrunk towards zero. This shrinkage toward zero is the price for simultaneously estimating the support, or locations of the nonzero entries, in a sparse vector as well as the values of the nonzero entries. In the context of convex co-clustering, the CoCo estimator U is shrunk towards the tensor X, namely the tensor whose entries are all equal to the average over all entries of X. The weights, however, play a critical role in reducing this bias. In fact, the weights can be seen as serving the same role as weights used in the adaptive lasso (Zou, 2006).
There are several possible extensions and open problems that have been left for future work. First, we note that there is a gap between what our theory predicts and what seems possible from our experiments. Specifically, Theorem 9 assumes uniform weights for each mode, yet simulation experiments indicate that the CoCo estimator using Tucker derived Gaussian kernel weights (9) can significantly outperform the CoCo estimator using uniform weights. One open problem is to derive prediction error bounds that relax the uniform weights assumption.
Second, although we have developed automatic methods for constructing the weights that work well empirically, other approaches to constructing the weights is a direction of future search. For example, other tensor approximation methods, such as the use of the ℓ 1 -norm to make the decomposition most robust to heavy tail noise as done by Cao et al. (2015), could possibly improve the quality of the weights.
Third, in this paper we have focused on additive noise that is a zero-mean M-concentrated random variable. Real data, however, may not follow such a distribution motivating coclustering procedures that can handle outliers. To address potential robustness issues, the CoCo framework could be extended to handle outliers by swapping the sum of squared residuals term in (5) with an analogous Huber loss or Tukey's Biweight function.
Finally, while our first order algorithm for co-clustering tensors scales linearly in the size of the data, data tensors inevitably will only increase in size motivating the need for more scalable algorithms for computing the CoCo estimator. A natural approach would be to adopt an existing distributed version of the proximal methods, such as one the methods proposed by Pesquet (2011), Chen andOzdaglar (2012), Li et al. (2013), or Eckstein (2017. Another natural approach would be to investigate if stochastic versions of the recently proposed generalized dual gradient ascent (Ho et al., 2019) could be adapted to compute the CoCo estimator. Additionally, in practice many data tensors that we would like to co-cluster may be very sparse. The first order algorithm presented here assumes the data tensor is dense. Consequently, an important direction of future work is to investigate alternative optimization algorithms that could leverage the sparsity structure within a data tensor.
We review two basic tensor decompositions that generalize the singular value decomposition (SVD) of a matrix: (i) the CANDECOMP/PARAFAC (CP) decomposition (Carroll and Chang, 1970;Harshman, 1970) and (ii) the Tucker decomposition (Tucker, 1966). Just as the SVD can be used to construct a lower-dimensional approximation to a data matrix, these two decompositions can be used to construct a lower dimensional approximation to a D-way tensor X ∈ ℝ n 1 × n 2 × ⋯ × n D The CP decomposition aims to approximate X by a sum of rank-one tensors, namely where • represents the outer product and a i (d) is the ith column of the dth factor matrix The positive integer R denotes the rank of the approximation. For sufficiently large R, we can exactly represent X with a CP decomposition.
The Tucker decomposition aims to approximate X by a core tensor H ∈ ℝ R 1 × R 2 × ⋯ × R D multiplied by factor matrices along each of its modes, namely We check to see if the solution U is continuous in the variable ζ = (x T , w T ) T . It is easy to verify that the following function is jointly continuous in U and ζ is a convex function of U that is continuous in (U, w). Let U ⋆ (ζ) = arg min U f(U, ζ) .
Since f(U, ζ) is strongly convex in U, the minimizer U ⋆ (ζ) exists and is unique.
We proceed with a proof by contradiction. Suppose U ⋆ (ζ) is not continuous at a point ζ. Then there exists an ϵ > 0 and a sequence {ζ (m) } converging to a limit ζ such that Since f(U, ζ) is strongly convex in U, the minimizer U (m) exists and is unique. Without loss of generality, we can assume ∥ζ (m) -ζ∥ F ≤ 1. This fact will be used later in proving the boundedness of the sequence U (m) .
If U (m) is a bounded sequence, then we can pass to a convergent subsequence with limit U.
Fix an arbitrary point U. Note that f U (m) , ζ (m) ≤ f U, ζ (m) for all m. Since f is continuous in (U, ζ), taking limits gives us the inequality f(U, ζ) ≤ f(U, ζ) .
Since U was selected arbitrarily, it follows that U = U ⋆ (ζ), which is a contradiction. It only remains for us to show that the sequence U (m) is bounded.
Consider the function Note that g is convex, since it is the point-wise supremum of a collection of convex functions. Since f U, ζ (m) ≤ g(U) and f is strongly convex in U, it follows that g(U) is also strongly convex and therefore has a unique global minimizer U* such that g U* . It also follows that for all m. By the reverse triangle inequality it follows that Combining the inequalities in (14) and (15), we arrive at the conclusion that for all m. Suppose the sequence U (m) is unbounded, namely ‖U (m) ‖ F ∞. But since X (m) converges to X, the left hand side must diverge. Thus, we arrive at a contradiction if U (m) is unbounded. We can then conclude that

B.2 Proof of Proposition 5
It follows that e i T U (d) = e j T U (d) , since w is positive. Since the pair (i, j) is arbitrary, it follows that all the rows of U (d) are identical or in other words, U (d) = 1c T for some c ∈ ℝ n −d . ■

B.3 Proof of Proposition 6
We will show that there is a γ max such that for all γ ≥ γ max , the grand mean tensor X is the unique global minimizer to the primal objective (4). We will certify that X is the solution to the primal problem by showing that the optimal value of a dual problem, which lower bounds the primal, equals F γ (X).
For sufficiently large γ, the solution to the dual maximization problem coincides with the solution to the unconstrained maximization problem max λ − 1 2 ‖A T λ‖ 2 2 + λ, Ax , whose solution is λ* = (AA T ) † Ax. Plugging λ* into the dual objective gives an optimal value of 1 2 ‖A T AA T † Ax‖ 2 2 = 1 2 ‖x − I − A T AA T † A x‖ 2 2 .
Note that [I − A T (AA T ) † A] is the projection onto the orthogonal complement of the column space of A T , which is equivalent to the null space or kernel of A, denoted Ker(A). We will show below that Ker(A) is the span of the all ones vector. Consequently, We now argue that Ker(A) is the span of 1 ∈ ℝ n . We rely on the following fact: If Φ d is an incidence matrix of a connected graph with n d vertices, then the rank of Φ d is n d − 1 (Deo, 1974, Theorem 7.2 Recall that the rank of the Kronecker product A⊗B is the product of the ranks of the matrices A and B. This rank property of Kronecker products of matrices implies that the dimension of Ker(A d ) equals n −d . Let b i = 1 n D ⊗ ⋯ ⊗ 1 n d + 1 ⊗ e i ⊗ 1 n d − 1 ⊗ ⋯ ⊗ 1 n 1 where 1 p ∈ ℝ p is the vector of all ones and e i ∈ ℝ n d is the ith standard basis vector. Then that the set of vectors B = b 1 , b 2 , …, b n d forms a basis for Ker(A d ).
Take an arbitrary element from Ker(A d ), namely a vector of the form 1 n′ ⊗ a ⊗ 1 n″ , where n′ = ∏ j = d + 1 D n j and n″ = ∏ j = 1 d − 1 . We will show that in order for 1 n′ ⊗ a ⊗ 1 d″ ∈ Ker(I ⊗ Φ d ), a must be a multiple of 1 n d . Consider the relevant matrix-vector product A d 1 n D ⊗ a ⊗ 1 n 1 = 1 n D ⊗ ⋯ ⊗ 1 n d + 1 ⊗ Φ d a ⊗ 1 n d − 1 ⊗ ⋯ ⊗ 1 n 1 .
Therefore, A d (1 n′ ⊗ a ⊗ 1 d″ ) = 0 if and only if Φ d a = 0. But the only way for Φ d a to be zero is for a = c1 n d for some c ∈ ℝ. Thus, Ker(A) is the span of 1 n .

B.4 Proof of Proposition 7
Note that U is the proximal mapping of the closed, convex function Then U is firmly nonexpansive in X (Combettes and Wajs, 2005, Lemma 2.4). Finally, firmly nonexpansive mappings are nonexpansive, which completes the proof. ■

Appendix C.: Proof of Theorem 9
We first prove some auxiliary lemmas before proving our prediction error result.

C.1 Auxiliary Lemmas
The following lemma considers the concentration of a random quadratic form y T By for a M-concentrated random vector y and a deterministic matrix B (Vu and Wang, 2015). It can be viewed as a generalization of the standard Hanson and Wright inequality for the quadratic forms of independent sub-Gaussian random variables (Hanson and Wright, 1971).
for any d = 1, … ,D. This fact together with the definition of G d = U d Λ d in (17) imply that solving our convex tensor clustering in (18)  Note that ‖x − u‖ 2 2 − ‖x − u*‖ 2 2 = ‖u‖ 2 2 − ‖u*‖ 2 2 − 2x T u − u* = ‖u − u*‖ 2 2 + 2ϵ T u − u* , where the last equality is due to the model assumption x = u* + ϵ Therefore, we have Next we derive the bound for f α d , β d . Note that the optimization over α d in (20) has a closed-form since the penalty term is independent of α d . In particular, by setting Plugging the results in (24) and (25) into (23), we obtain that, for each d = 1, … ,D where C 0 is a constant upper bound for the entries of U*. Combining the inequalities in (26) and (27)  Dividing both sides by n gives to the prediction error bound in (7). This ends the proof of Theorem 9. ■ where u = vec(U) = vec U (1) , namely the column-major vectorization of the mode-1 matricization of the tensor U. So, Note that Y = U × d A is equivalent to Y (d) = AU (d) .
We rewrite the penalty function R d as follows. The Lagrangian dual objective is given by G(λ) by minimizing the Lagrangian L(u, v, λ) over the primal variables u and v, namely where ι C d, l is the indicator function of the closed convex set C d,l = {z : ∥z∥ 2 ≤ γw d,l }.
The last equality in (28) follows from the fact that the Fenchel conjugate of a norm is the indicator function of the unit dual norm ball. Recall that the Fenchel conjugate f* of a function f is given by Let B = {λ : ∥λ∥ 2 ≤ 1} denote the unit ℓ 2 -norm ball. Since the ℓ 2 -norm is self dual, we arrive at the identity 2017). Consequently, we employ sparse weights. Specifically, we keep positive weights between approximately nearest-neighbor mode-d subarrays so that |E d | is O n d . By using these sparse weights, the per-iteration and storage costs scale more reasonably as O Dn , namely linearly in either the number of dimensions D or in the number of elements n.
Details on our weights choices are elaborated in Section 6.

E.2 Convergence
The sequence of dual iterates λ (m) is guaranteed to converge to a solution λ of (8) provided that the step-size parameter η is less than twice the reciprocal of the spectral radius of the matrix A T A (Combettes and Wajs, 2005, Theorem 3.4). Consequently, the sequence of primal iterates u (m) is guaranteed to converge to the CoCo estimator û. We note that under the same step-size conditions, convergence of the sequence u (m) can also be guaranteed by observing that the projected gradient algorithm applied to the dual problem (8) is an example of the alternating minimization algorithm (Tseng, 1991, Proposition 2).

E.3 Monitoring Convergence via the Duality Gap
Recall that we can bound the suboptimality of the mth iterate, F γ (u (m) ) − F γ (û), by the duality gap F γ (u (m) ) − G(λ (m) ), which can be expressed solely in terms of the mth iterate of the primal variable u (m) , namely For any optimal dual solution λ, the gap vanishes, namely F γ (u) = G(λ). Note that computing the duality gap incurs minimal additional cost as u (m) and A d,l u (m) are already computed as part of the gradient step. In short, including a duality gap computation will not change the O Dn per-iteration cost of the projected gradient algorithm. In practice, we can terminate the algorithm once the duality gap falls below some small tolerance.

E.4 Computing Mode-d Difference Variables
In Section 7.2, we explained how clustering assignments along the dth mode are made using the mode-d difference variables v d, l = U × d Δ d, l . In practice we must deal with the fact that the û recovered by computing x − A T λ may exhibit a nearly but not exactly checkerbox structure due to limitations in numerical precision. This creates a practical issue as a small but non-zero difference variable will lead to an incorrect clustering assignment. Addressing this issue, however, is simple. The projected gradient algorithm used to compute CoCo is a natural generalization of the projected gradient algorithm used in Chi and Lange (2015) for convex clustering. Consequently, we can use the obvious adaptation of the procedure for computing the differences variables in convex clustering. The following brief technical discussion is expanded in more detail in Chi and Lange (2015).
The key fact that we use is that the projected gradient algorithm is equivalent to the alternating minimization algorithm (AMA) applied to the following augmented Lagrangian function The mode-d difference vector v d,l is determined by the proximal map v d, l = arg min v d, l 1 2 ‖v d, l − A d, l u − η −1 λ d, l ‖ 2 2 + γw d, l η ‖v d, l ‖ 2 = prox σ d, l ‖ ⋅ ‖ 2 A d, l u − η −1 λ d, l , where σ d,l = γw d,l /η. Because the proximal mapping can produce mode-d difference variables that are exactly zero, the procedure for computing v d,l in (31) is immune to the numerical precision issues that hinder the direct computation U × d Δ d, l .

Setting Weights
Employing the Tucker decomposition introduces another tuning parameter, namely the rank of the decomposition. When applicable, a user can leverage problem-specific knowledge to select the rank for the decomposition. Nonetheless, the availability of an automatic approach is desirable to handle cases when such knowledge is unavailable. Selecting the rank in a tensor decomposition, however, is an open question (Kolda and Bader, 2009;Yokota et al., 2017). During initial experiments, a few different methods for selecting the Tucker decomposition rank from the literature were compared: an L-curve approach that attempts to strike a balance between the decomposition's relative error and compression ratio, as implemented by the mlrankest function in the Tensorlab MATLAB toolbox (Vervliet et al., 2016), minimum description length (Rissanen, 1978;Yokota et al., 2017), and the recentlyproposed SCORE algorithm (Yokota et al., 2017). Out of these, the SCORE algorithm produced the best average CoCo estimator performance. The SCORE algorithm itself includes a tuning parameter, ρ, and Yokota et al. (2017) suggest setting ρ ∈ 10 −4 , 10 −2 . We considered ρ ∈ 10 −4 , 10 −3 , 10 −2 and found 10 −3 to perform the best, which also matches the value used in the experiments by Yokota et al. (2017).
We also developed a simple yet effective heuristic for choosing the rank where we set the Tucker rank for the dth mode to be the floor of n d /2. Two principles motivating the heuristic are that the rank of the decomposition should be both small relative to and also in proportion to the length of the modes. Both the SCORE algorithm and our heuristic were employed in our simulations described in Section 8 as a robustness check to ensure our CoCo estimator's performance does not crucially depend on the choice of the rank.
The basic Tucker decomposition computation is accomplished by the higher order SVD (HOSVD) method (De Lathauwer et al., 2000) which computes for each mdoe k the r k leading left singular values of the mode-k matricization and stores them as a factor matrix U k . The HOSVD then computes the core tensor by contracting the data tensor X × k U k .
Thus, the main cost is computing D SVDs. This is an illustrative calculation, however, and more efficient alternatives exist (Vannieuwenhoven et al., 2012;Minster et al., 2020).

Appendix G.: CPD+k-means
We describe in greater detail the CPD+k-means method for co-clustering a D-way tensor X ∈ ℝ n 1 × ⋯ × n D . The method consists of two steps Step 1. Compute a rank-R CP decomposition where • represents the outer product and a i (d) is the ith column of the dth factor matrix A (d) ∈ ℝ n d × R .
Step 2. For each factor matrices A (d) , apply k-means clustering on the n d rows of A (d) . Note that the D applications of k-means are done independently for each mode-d factor matrix A (d) .
Tuning parameters: There are two sets of tuning parameters: (i) the rank parameter R, used in Step 1, and (ii) the D cluster number parameters for each factor matrix, used in Step 2.
To choose the rank parameter R, we create a candidate set of ranks R candidate ⊂ 1, 2, 3, … and select R ⋆ ∈ R candidate using the tuning procedure in Sun et al. (2017). We then compute a CP decomposition using the selected rank R* and obtain the factor matrices A (d) for d = 1, … ,D. To choose the D cluster number parameters, we create D candidate sets of cluster numbers K candidate d ⊂ 1, 2, 3, …, n d and select k d ⋆ ∈ K candidate d for d = 1, … ,D using the gap statistic procedure in Tibshirani et al. (2001). We use the D clustering results from running k-means on the rows of each of the A (d) using k d ⋆ .

Appendix H.: Additional Simulations on Rectangular Tensors
The first rectangular tensor is one in which there are two short modes (n 1 = n 2 = 10) and one relatively longer mode (n 3 = 50). Figure 14 presents the clustering results for this tensor shape.
At a lower noise level (σ = 2), CoCo performs very well and outperforms CPD+kmeans and CoTeC in terms of both single-mode clustering and co-clustering. When the noise level is bumped up (σ = 3), both methods experience a noticeable drop off in their performance and now perform more similarly. Interestingly, CoCo's single-mode clustering results are better along the two shorter modes (modes 1 and 2), which is not what we expected. This provides some evidence that the performance along a mode depends on both the length of that mode as well as the lengths of the other modes. When the length of the shorter modes are increased slightly (from n d = 10 to n d = 20 for d = 1, 2), CoCo has near-perfect performance while CPD+k-means performs roughly the same as before. Thus, CoCo struggles with this tensor shape only when the short modes are really short (only 10 observations).
To further investigate the mode-by-mode performance with rectangular tensors, we also apply the clustering methods to a "Goldilocks" tensor with mode lengths that are short, medium, and long. This setting was again motivated by the results from the previous two tensor shapes to see how the performance is impacted when the size of a longer mode is increased. The ARI results for this tensor shape are given in Figure 15d, and they are consistent with what was observed previously. When the short mode has only 10 observations, CoCo initially performs very well until the noise reaches a certain level. At this point, its performance for the longer modes declines sharply and actually performs worse than CPD+k-means, and this pattern is more pronounced for the longest mode (n 3 = 100). The overall co-clustering performance for both methods remains similar, however. As before, CoCo does not experience as much of a decrease when the shortest mode is made slightly longer (n 1 = 20), and does noticeably better than CPD+k-means for the most part.
Overall, from clustering these different tensor shapes we see that CoCo still generally performs very well and better than CPD+k-means. The main issue it encounters is when at least one mode is very short (n d = 10). CoCo performs very well a lower noise levels but has a sharp decline in performance once the noise reaches a certain level. Unexpectedly, the decline in single-mode performance is worse for the longer modes. However, even when this happens, CoCo's overall co-clustering performance is still comparable to CPD+k-means. Additionally, this pattern is much less striking when the length of the shortest mode is increased slightly.